Skip to content

fix: use buffered and script based get-reads implementation #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 22, 2025

Conversation

johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented May 20, 2025

Summary by CodeRabbit

  • New Features

    • Added a new script to extract paired-end reads from BAM files and output them as FASTQ files.
  • Improvements

    • Updated environment dependencies to include an additional required package.
    • Enhanced the read extraction process for improved accuracy and maintainability.
    • Updated configuration comments to reflect a higher optional read limit.
  • Refactor

    • Replaced previous shell-based extraction with a dedicated Python script.
    • Simplified configuration and resource handling for the read extraction rule.
    • Updated tool wrapper versions for improved compatibility.
    • Removed deprecated read limit function to streamline processing.

Copy link
Contributor

coderabbitai bot commented May 20, 2025

"""

Walkthrough

The environment configuration was updated to add the dnaio dependency. The get_read_limit_param function was removed from the workflow. The get_reads rule was refactored to use a new Python script, get-reads.py, for extracting paired-end reads from BAM files, replacing the previous shell-based approach and updating the conda environment. The samtools_faidx and index_bcf rules were updated to use newer wrapper versions. The configuration comment for limit-reads was updated to reflect a higher read limit.

Changes

File(s) Change Summary
workflow/envs/pysam.yaml Added dnaio=1.2 to the environment dependencies.
workflow/rules/common.smk Removed the get_read_limit_param function.
workflow/rules/download.smk Refactored the get_reads rule to use get-reads.py script, changed conda env to pysam.yaml, simplified parameters, removed explicit thread resources, and updated samtools_faidx wrapper version.
workflow/scripts/get-reads.py Introduced a new script to extract paired-end reads from a BAM file and write them to FASTQ using pysam and dnaio.
config/config.yaml Updated comment for limit-reads to indicate a 100,000 read limit instead of 10,000; no functional change to the value.
workflow/rules/utils.smk Updated index_bcf rule to use a newer wrapper version (v6.2.0 instead of v1.9.0).

Sequence Diagram(s)

sequenceDiagram
    participant Snakemake
    participant get-reads.py
    participant pysam
    participant dnaio

    Snakemake->>get-reads.py: Run script with BAM input and FASTQ outputs
    get-reads.py->>pysam: Open BAM file for reading
    loop For each alignment
        get-reads.py->>get-reads.py: Buffer reads by query name
        get-reads.py->>get-reads.py: Pair mates when both found
        get-reads.py->>dnaio: Write paired reads to FASTQ
    end
    get-reads.py->>Snakemake: Log progress and completion
Loading

Poem

In the warren, scripts now hop anew,
With dnaio and pysam, the reads flow through.
No more shell commands, just Pythonic delight,
Paired-end friends find their FASTQ flight.
With dependencies set and old code shed,
The workflow leaps ahead!
🐇✨
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e86719 and 7b2cbcc.

📒 Files selected for processing (1)
  • workflow/rules/utils.smk (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Testing
🔇 Additional comments (1)
workflow/rules/utils.smk (1)

20-20: Appropriate wrapper version update

Updating the bcftools index wrapper from v1.9.0 to v6.2.0 aligns with the other wrapper standardization changes in this PR. This significant version jump should provide better compatibility with newer tools and is consistent with the samtools_faidx wrapper update mentioned in the summary.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
workflow/envs/pysam.yaml (1)

9-9: Add a newline at the end of the file.

The file is missing a newline character at the end, which is considered a best practice and required by some tools.

  - dnaio =1.2
+ - dnaio =1.2
+
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

workflow/scripts/get-reads.py (2)

3-3: Use a context manager for opening the log file.

Using a context manager ensures the file is properly closed even if exceptions occur.

-sys.stderr = open(snakemake.log[0], "w")
+with open(snakemake.log[0], "w") as log_file:
+    sys.stderr = log_file
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


35-37: Make the read1/read2 ordering more robust.

The current code only checks if the alignment is read2, but doesn't verify the other alignment is read1. This could lead to incorrect pairing in some edge cases.

-            if aln.is_read2:
+            if aln.is_read2 and mate_aln.is_read1:
                 aln, mate_aln = mate_aln, aln
+            elif aln.is_read1 and mate_aln.is_read2:
+                # Already in correct order
+                pass
+            else:
+                # Handle unexpected read flags
+                continue
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc4dca and b415c7c.

📒 Files selected for processing (4)
  • workflow/envs/pysam.yaml (1 hunks)
  • workflow/rules/common.smk (0 hunks)
  • workflow/rules/download.smk (1 hunks)
  • workflow/scripts/get-reads.py (1 hunks)
💤 Files with no reviewable changes (1)
  • workflow/rules/common.smk
🧰 Additional context used
🪛 YAMLlint (1.37.1)
workflow/envs/pysam.yaml

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)

🔇 Additional comments (3)
workflow/envs/pysam.yaml (1)

9-9: Addition of dnaio looks good!

The dnaio package is correctly added to support the new Python script-based approach for handling FASTQ files.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

workflow/rules/download.smk (2)

6-6: Parameter handling looks good.

The branch function effectively replaces the previous get_read_limit_param approach, providing a cleaner way to conditionally set the read limit.


11-14: Good refactoring to use a Python script instead of shell commands.

Switching from a shell-based approach to a dedicated Python script improves maintainability and reliability for handling paired reads. The conda environment has been appropriately updated to include the required dependencies.

johanneskoester and others added 2 commits May 20, 2025 10:09
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
workflow/scripts/get-reads.py (4)

1-7: Move imports to the top of the file.

Following Python best practices (PEP 8), all imports should be placed at the top of the file before any other code. The current placement interrupts the import section with stderr redirection.

 import sys
+import pysam
+import dnaio
 
 sys.stderr = open(snakemake.log[0], "w")
-
-import pysam
-import dnaio
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


3-3: Use a context manager for the log file.

The current approach of directly reassigning sys.stderr doesn't ensure the file is properly closed when the script ends. Using a context manager or at least closing the file at the end of the script is recommended.

-sys.stderr = open(snakemake.log[0], "w")
+# Option 1: Redirect within a try-finally to ensure closing
+log_file = open(snakemake.log[0], "w")
+sys.stderr = log_file
+try:
+    # Rest of your code would be indented here
+finally:
+    log_file.close()
+
+# Option 2: Use a custom context manager for stderr redirection
+class RedirectStderr:
+    def __init__(self, file_path):
+        self.file_path = file_path
+        self.original_stderr = None
+        self.file_obj = None
+    
+    def __enter__(self):
+        self.file_obj = open(self.file_path, "w")
+        self.original_stderr = sys.stderr
+        sys.stderr = self.file_obj
+        return self.file_obj
+    
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        sys.stderr = self.original_stderr
+        if self.file_obj:
+            self.file_obj.close()
+
+with RedirectStderr(snakemake.log[0]):
+    # Rest of your code would be here
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


38-40: Add orphaned read handling at end of file processing.

Currently, any reads in the buffer without mates at the end of processing are silently ignored. Consider handling these orphaned reads by either logging them or optionally writing them as single-end reads.

            fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln))
            n_written += 1
+
+# Add this at the end, after the loop completes
+if buffer:
+    print(f"Warning: {len(buffer)} unpaired reads found in BAM file", file=sys.stderr)
+    # Optionally, write unpaired reads as single-end if needed
+    # for orphaned_aln in buffer.values():
+    #     fqwriter.write(aln_to_fq(orphaned_aln), None)

40-40: Explicitly close the BAM file.

The BAM file should be explicitly closed after processing to ensure proper resource cleanup, especially for large files.

            fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln))
            n_written += 1
+
+# Close the BAM file after processing
+bam.close()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b415c7c and 27bad31.

📒 Files selected for processing (2)
  • workflow/rules/download.smk (1 hunks)
  • workflow/scripts/get-reads.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/rules/download.smk
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)

🔇 Additional comments (2)
workflow/scripts/get-reads.py (2)

24-25: LGTM!

The condition to check if limit is None before comparing has been properly implemented, addressing a previous issue.


26-27: LGTM!

The condition to skip secondary and supplementary alignments is now correct, addressing a previous issue.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
workflow/scripts/get-reads.py (3)

9-14: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.query_qualities could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.

 def aln_to_fq(aln):
+    # Handle case when query_qualities is None
+    qualities = aln.query_qualities
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+    
     return dnaio.SequenceRecord(
         name=aln.query_name,
         sequence=aln.get_forward_sequence(),
-        quality=aln.query_qualities,
+        quality=qualities,
     )

17-18: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.

 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


29-33: Fix inconsistent query name usage in buffer handling.

The code computes qname by removing suffixes but then doesn't use it for the buffer lookup or storage, which could lead to missing mate pairs if the suffixes are inconsistently present.

-        qname = aln.query_name.removesuffix("/1").removesuffix("/2")
+        # Remove potential read suffixes for consistent mate pairing
+        qname = aln.query_name.removesuffix("/1").removesuffix("/2")
 
-        mate_aln = buffer.get(aln.query_name)
+        mate_aln = buffer.get(qname)
         if mate_aln is None:
-            buffer[aln.query_name] = aln
+            buffer[qname] = aln
🧹 Nitpick comments (3)
workflow/scripts/get-reads.py (3)

3-3: Use a context manager for file operations.

This line opens a file for stderr redirection but doesn't handle closing it properly. Using a context manager ensures the file is properly closed even if an exception occurs.

-sys.stderr = open(snakemake.log[0], "w")
+# Use context manager or explicit close to prevent resource leaks
+sys.stderr = open(snakemake.log[0], "w")

Alternatively, consider using a context manager:

import contextlib
with open(snakemake.log[0], "w") as log_file:
    with contextlib.redirect_stderr(log_file):
        # Rest of your code
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


18-18: Close BAM file explicitly after processing.

The script doesn't explicitly close the BAM file after processing, which could lead to resource leaks. Consider using a context manager or explicitly closing the file.

-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+# Use context manager to ensure proper resource cleanup
+with pysam.AlignmentFile(snakemake.params.bam_url) as bam:
+    # Move the processing code inside this context
🧰 Tools
🪛 Ruff (0.11.9)

18-18: Undefined name snakemake

(F821)


1-40: Add docstring and include error handling for the script.

The script lacks a docstring explaining its purpose and usage. Additionally, it could benefit from more comprehensive error handling, especially for the main processing loop.

 import sys
+"""
+Extract paired-end reads from a BAM file and write them to FASTQ files.
+
+This script uses pysam to read alignments from a BAM file and dnaio to write
+paired FASTQ output files. It buffers alignments by query name to pair mates,
+skipping secondary and supplementary alignments.
+
+Input:
+    - BAM file URL (from snakemake.params.bam_url)
+    - Optional read limit (from snakemake.params.limit)
+
+Output:
+    - Paired FASTQ files (from snakemake.output[0] and snakemake.output[1])
+    - Log file (from snakemake.log[0])
+"""

Additionally, consider adding more robust error handling throughout:

try:
    with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
        # Main processing code
except Exception as e:
    print(f"Error processing BAM file or writing FASTQ: {e}", file=sys.stderr)
    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 27bad31 and 7f7a4c6.

📒 Files selected for processing (1)
  • workflow/scripts/get-reads.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)

9-14: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.query_qualities could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.

 def aln_to_fq(qname, aln):
+    # Handle case when query_qualities is None
+    qualities = aln.query_qualities
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities=aln.query_qualities,
+        qualities=qualities,
     )

18-18: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.

-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

18-18: Undefined name snakemake

(F821)

🧹 Nitpick comments (1)
workflow/scripts/get-reads.py (1)

3-3: Use a context manager for file handling.

The file is opened without using a context manager, which could lead to resource leaks if the script exits unexpectedly.

-sys.stderr = open(snakemake.log[0], "w")
+with open(snakemake.log[0], "w") as log_file:
+    sys.stderr = log_file
+    
+    # Main script logic should be indented here
+    # Or use a separate function that's called within this context
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7f7a4c6 and e063c93.

📒 Files selected for processing (1)
  • workflow/scripts/get-reads.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)

🔇 Additional comments (2)
workflow/scripts/get-reads.py (2)

24-27: Fixed issues from previous review.

The code now correctly:

  1. Checks if limit is None before comparing with n_written
  2. Uses the correct condition to skip secondary and supplementary alignments

Great job addressing these issues!


29-39: Correct implementation of read pair matching.

The code now properly:

  1. Removes suffixes from query names for consistent matching
  2. Uses the normalized query name for buffer lookup and storage
  3. Ensures correct ordering of read pairs before writing

This implementation effectively handles paired-end reads from BAM files.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
workflow/scripts/get-reads.py (1)

22-42: 🛠️ Refactor suggestion

Implement buffer cleanup mechanism.

The current implementation stores unpaired reads in the buffer but never clears them, which could lead to memory issues with large BAM files containing many unpaired reads.

Consider implementing one of these approaches:

  1. Periodically check and remove old entries from the buffer
  2. Set a maximum buffer size and handle overflow
  3. After processing all alignments, report any unpaired reads left in the buffer
 buffer = {}
 n_written = 0
+buffer_limit = 1000000  # Set a reasonable buffer size limit
+
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
         if limit is not None and n_written >= limit:
             break
         if aln.is_secondary or aln.is_supplementary:
             continue
 
         qname = aln.query_name.removesuffix("/1").removesuffix("/2")
 
         mate_aln = buffer.get(qname)
         if mate_aln is None:
             buffer[qname] = aln
+            # Check if buffer size exceeds limit
+            if len(buffer) > buffer_limit:
+                print(f"Warning: Buffer size exceeds {buffer_limit}. Some reads may be unpaired.", 
+                      file=sys.stderr)
         else:
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
 
             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1
+            # Remove entry from buffer after successful write
+            del buffer[qname]
+
+# Report statistics on unpaired reads
+if buffer:
+    print(f"Warning: {len(buffer)} reads remained unpaired after processing.", file=sys.stderr)
🧰 Tools
🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)

9-16: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.get_forward_qualities() could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.

 def aln_to_fq(qname, aln):
+    # Handle case when get_forward_qualities returns None
+    qualities = aln.get_forward_qualities()
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+        quality_str = "!" * len(aln.get_forward_sequence())  # '!' is the lowest quality character
+    else:
+        quality_str = "".join(map(lambda qual: chr(qual + 33), qualities))
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities="".join(
-            map(lambda qual: chr(qual + 33), aln.get_forward_qualities())
-        ),
+        qualities=quality_str,
     )

19-21: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.

+import os
+
 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)

🧹 Nitpick comments (2)
workflow/scripts/get-reads.py (2)

23-27: Consider adding progress reporting.

For large BAM files, processing can take a significant amount of time. Adding progress reporting would help monitor script execution.

 buffer = {}
 n_written = 0
+report_interval = 100000  # Report every 100k read pairs
+processed = 0
+
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
+        processed += 1
+        if processed % report_interval == 0:
+            print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr)
+
         if limit is not None and n_written >= limit:
+            print(f"Reached limit of {limit} read pairs. Stopping.", file=sys.stderr)
             break
🧰 Tools
🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


37-41: Validate read pairing before writing.

The code assumes that if one read is marked as read2, then the other must be read1, but it doesn't explicitly verify this. Adding validation would prevent potential errors.

         else:
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
 
+            # Validate proper read pairing
+            if (aln.is_read1 and mate_aln.is_read2) or (aln.is_read2 and mate_aln.is_read1):
+                # Reads are properly paired
+                pass
+            else:
+                print(f"Warning: Improper read pairing for {qname}. Flags: {aln.flag}, {mate_aln.flag}", 
+                      file=sys.stderr)
+            
             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between e063c93 and d0820d3.

📒 Files selected for processing (1)
  • workflow/scripts/get-reads.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)

1-7: Improve file handling with context managers and error handling

The current implementation directly assigns to sys.stderr without proper error handling or cleanup mechanism.

 import sys
+import os
 
-sys.stderr = open(snakemake.log[0], "w")
+# Redirect stderr to log file using context manager
+try:
+    sys.stderr = open(snakemake.log[0], "w", buffering=1)
+except (IOError, OSError) as e:
+    print(f"Error opening log file: {e}", file=sys.stderr)
+    sys.exit(1)
 
 import pysam
 import dnaio
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


19-21: Add error handling for BAM file operations

The current code lacks error handling when opening and accessing the BAM file, which could lead to uncaught exceptions.

 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)

🧹 Nitpick comments (3)
workflow/scripts/get-reads.py (3)

24-29: Add progress reporting for long-running operations

For large BAM files, there's no indication of progress, which can make it difficult to monitor the script's execution.

 n_written = 0
+processed = 0
+report_interval = 100000  # Report progress every 100k reads
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
+        processed += 1
+        if processed % report_interval == 0:
+            print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr)
         if limit is not None and n_written >= limit:
+            print(f"Reached limit of {limit} pairs, stopping", file=sys.stderr)
             break
         if aln.is_secondary or aln.is_supplementary:
             continue
🧰 Tools
🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


38-44: Validate read pairs before writing

The current implementation assumes that reads with the same name belong to the same fragment, but additional validation could ensure correct pairing.

         else:
+            # Ensure reads are properly paired
+            if not (aln.is_read1 and mate_aln.is_read2) and not (aln.is_read2 and mate_aln.is_read1):
+                print(f"Warning: Reads with name {qname} don't appear to be proper pairs based on flags", file=sys.stderr)
+            
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
             del buffer[qname]

             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1

1-3: Import snakemake explicitly for clarity

The script uses the snakemake variable without importing it, which works in Snakemake's execution environment but reduces code clarity.

 import sys
+# The snakemake variable is implicitly available when executed through Snakemake
+# Adding this comment improves clarity for developers unfamiliar with Snakemake
 
 sys.stderr = open(snakemake.log[0], "w")
🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between d0820d3 and 0f7538f.

📒 Files selected for processing (1)
  • workflow/scripts/get-reads.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/scripts/get-reads.py (2)

9-16: 🛠️ Refactor suggestion

Handle potential None qualities in the aln_to_fq function.

The function doesn't handle the case where aln.get_forward_qualities() could be None or empty, which can happen with some BAM files.

 def aln_to_fq(qname, aln):
+    # Handle case when forward qualities might be None
+    qualities = aln.get_forward_qualities()
+    if qualities is None:
+        # Default to quality score of 30 (ASCII '?') for all bases
+        quality_str = "?" * len(aln.get_forward_sequence())
+    else:
+        quality_str = "".join(map(lambda qual: chr(qual + 33), qualities))
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities="".join(
-            map(lambda qual: chr(qual + 33), aln.get_forward_qualities())
-        ),
+        qualities=quality_str,
     )

1-7: 🛠️ Refactor suggestion

Use a context manager for the log file.

The current code directly assigns to sys.stderr without ensuring proper file closure. Also, the script uses snakemake which is implicitly provided in the Snakemake environment but not imported explicitly.

 import sys
 
-sys.stderr = open(snakemake.log[0], "w")
+# Use context manager for log file with line buffering
+log_file = open(snakemake.log[0], "w", buffering=1)
+sys.stderr = log_file
 
 import pysam
 import dnaio

The code should also close the log file at the end of script execution or restore the original stderr.

🧰 Tools
🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)

🧹 Nitpick comments (2)
workflow/scripts/get-reads.py (2)

19-20: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.

 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools
🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


24-44: The read pairing logic looks good but consider adding more validation.

The read pairing logic is sound and handles the use case correctly. However, consider adding validation to ensure that paired reads actually belong together (e.g., checking read flags or positions).

         else:
+            # Validate that reads are properly paired
+            if not (aln.is_paired and mate_aln.is_paired):
+                print(f"Warning: Read {qname} has inconsistent paired flags", file=sys.stderr)
+            
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
             del buffer[qname]
🧰 Tools
🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7538f and 11961b9.

📒 Files selected for processing (1)
  • workflow/scripts/get-reads.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Formatting
🔇 Additional comments (1)
workflow/scripts/get-reads.py (1)

46-47: Good handling of orphaned reads.

The code correctly handles orphaned reads by providing a warning message. This is a good improvement from previous iterations.

Copy link
Collaborator

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should we check for updates for the wrappers in a seperate PR aswell? I noticed that you updated a few of them :)

@johanneskoester
Copy link
Contributor Author

LGTM! Should we check for updates for the wrappers in a seperate PR aswell? I noticed that you updated a few of them :)

I would say that is not urgent. In particular, it would trigger reruns, which are probably a good thing before re-uploading but not while you still finalize the analysis.

@johanneskoester johanneskoester merged commit b9d4360 into main May 22, 2025
7 checks passed
@johanneskoester johanneskoester deleted the fix/get-reads-buffered branch May 22, 2025 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants